##Introduction

In this project, we’re assigned to 3 tasks

1.Identify factor that lead to attrition, also identify the top three factors that contribute to turnover.

2.The executive leadership is also interested in learning about any job role specific trends that may exist in the data set. We need to provide any other interesting trends and observations from your analysis.

3.We’re asked to build a model to predict attrition and salary.

#Import and tidying datasets

Import data CaseStudy2-data.csv

Import data CaseStudy2CompSet No Attrition.csv

Import data CaseStudy2CompSet No Salary.xlsx

Data set overview

There are 870 obs. of 32 variables. Among the 32 variables, 8 columns are characters, 10 columns are factors, 14 columns are numeric. No missing values in data set.

##        ID             Age         Attrition         BusinessTravel    
##  Min.   :  1.0   Min.   :18.00   Length:870         Length:870        
##  1st Qu.:218.2   1st Qu.:30.00   Class :character   Class :character  
##  Median :435.5   Median :35.00   Mode  :character   Mode  :character  
##  Mean   :435.5   Mean   :36.83                                        
##  3rd Qu.:652.8   3rd Qu.:43.00                                        
##  Max.   :870.0   Max.   :60.00                                        
##    DailyRate       Department        DistanceFromHome   Education    
##  Min.   : 103.0   Length:870         Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 472.5   Class :character   1st Qu.: 2.000   1st Qu.:2.000  
##  Median : 817.5   Mode  :character   Median : 7.000   Median :3.000  
##  Mean   : 815.2                      Mean   : 9.339   Mean   :2.901  
##  3rd Qu.:1165.8                      3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :1499.0                      Max.   :29.000   Max.   :5.000  
##  EducationField     EmployeeCount EmployeeNumber   EnvironmentSatisfaction
##  Length:870         Min.   :1     Min.   :   1.0   Min.   :1.000          
##  Class :character   1st Qu.:1     1st Qu.: 477.2   1st Qu.:2.000          
##  Mode  :character   Median :1     Median :1039.0   Median :3.000          
##                     Mean   :1     Mean   :1029.8   Mean   :2.701          
##                     3rd Qu.:1     3rd Qu.:1561.5   3rd Qu.:4.000          
##                     Max.   :1     Max.   :2064.0   Max.   :4.000          
##     Gender            HourlyRate     JobInvolvement     JobLevel    
##  Length:870         Min.   : 30.00   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.: 48.00   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Median : 66.00   Median :3.000   Median :2.000  
##                     Mean   : 65.61   Mean   :2.723   Mean   :2.039  
##                     3rd Qu.: 83.00   3rd Qu.:3.000   3rd Qu.:3.000  
##                     Max.   :100.00   Max.   :4.000   Max.   :5.000  
##    JobRole          JobSatisfaction MaritalStatus      MonthlyIncome  
##  Length:870         Min.   :1.000   Length:870         Min.   : 1081  
##  Class :character   1st Qu.:2.000   Class :character   1st Qu.: 2840  
##  Mode  :character   Median :3.000   Mode  :character   Median : 4946  
##                     Mean   :2.709                      Mean   : 6390  
##                     3rd Qu.:4.000                      3rd Qu.: 8182  
##                     Max.   :4.000                      Max.   :19999  
##   MonthlyRate    NumCompaniesWorked    Over18            OverTime        
##  Min.   : 2094   Min.   :0.000      Length:870         Length:870        
##  1st Qu.: 8092   1st Qu.:1.000      Class :character   Class :character  
##  Median :14074   Median :2.000      Mode  :character   Mode  :character  
##  Mean   :14326   Mean   :2.728                                           
##  3rd Qu.:20456   3rd Qu.:4.000                                           
##  Max.   :26997   Max.   :9.000                                           
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
##  Min.   :11.0      Min.   :3.000     Min.   :1.000            Min.   :80   
##  1st Qu.:12.0      1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80   
##  Median :14.0      Median :3.000     Median :3.000            Median :80   
##  Mean   :15.2      Mean   :3.152     Mean   :2.707            Mean   :80   
##  3rd Qu.:18.0      3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80   
##  Max.   :25.0      Max.   :4.000     Max.   :4.000            Max.   :80   
##  StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
##  Min.   :0.0000   Min.   : 0.00     Min.   :0.000         Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000  
##  Median :1.0000   Median :10.00     Median :3.000         Median :3.000  
##  Mean   :0.7839   Mean   :11.05     Mean   :2.832         Mean   :2.782  
##  3rd Qu.:1.0000   3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000  
##  Max.   :3.0000   Max.   :40.00     Max.   :6.000         Max.   :4.000  
##  YearsAtCompany   YearsInCurrentRole YearsSinceLastPromotion
##  Min.   : 0.000   Min.   : 0.000     Min.   : 0.000         
##  1st Qu.: 3.000   1st Qu.: 2.000     1st Qu.: 0.000         
##  Median : 5.000   Median : 3.000     Median : 1.000         
##  Mean   : 6.962   Mean   : 4.205     Mean   : 2.169         
##  3rd Qu.:10.000   3rd Qu.: 7.000     3rd Qu.: 3.000         
##  Max.   :40.000   Max.   :18.000     Max.   :15.000         
##  YearsWithCurrManager
##  Min.   : 0.00       
##  1st Qu.: 2.00       
##  Median : 3.00       
##  Mean   : 4.14       
##  3rd Qu.: 7.00       
##  Max.   :17.00
## 'data.frame':    870 obs. of  36 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : chr  "No" "No" "No" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Sales" ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : chr  "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : int  2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : chr  "Male" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : int  3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
##  $ JobSatisfaction         : int  4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : chr  "Divorced" "Single" "Single" "Married" ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "No" "No" "No" "No" ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : int  3 3 3 3 3 4 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  3 1 3 3 3 3 1 3 4 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  1 0 0 2 0 2 0 3 1 1 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : int  2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...
Data summary
Name training_data
Number of rows 870
Number of columns 36
_______________________
Column type frequency:
character 9
numeric 27
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Attrition 0 1 2 3 0 2 0
BusinessTravel 0 1 10 17 0 3 0
Department 0 1 5 22 0 3 0
EducationField 0 1 5 16 0 6 0
Gender 0 1 4 6 0 2 0
JobRole 0 1 7 25 0 9 0
MaritalStatus 0 1 6 8 0 3 0
Over18 0 1 1 1 0 1 0
OverTime 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 435.50 251.29 1 218.25 435.5 652.75 870 ▇▇▇▇▇
Age 0 1 36.83 8.93 18 30.00 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.50 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.00 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.00 3.0 4.00 5 ▂▅▇▆▁
EmployeeCount 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
EmployeeNumber 0 1 1029.83 604.79 1 477.25 1039.0 1561.50 2064 ▇▇▇▇▇
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.00 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.00 66.0 83.00 100 ▇▇▆▇▇
JobInvolvement 0 1 2.72 0.70 1 2.00 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.04 1.09 1 1.00 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.71 1.11 1 2.00 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.50 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.00 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.00 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.00 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.10 1 2.00 3.0 4.00 4 ▅▅▁▇▇
StandardHours 0 1 80.00 0.00 80 80.00 80.0 80.00 80 ▁▁▇▁▁
StockOptionLevel 0 1 0.78 0.86 0 0.00 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.00 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.00 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.78 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 6.96 6.02 0 3.00 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁

##Removing unnecessary columns from training set and setting all categorial to be factors

Data visualization

Attrition By Department

##   Attrition   n
## 1        No  29
## 2       Yes   6
## 3        No 487
## 4       Yes  75
## 5        No 214
## 6       Yes  59

Attrition VS Age

There seems to be a quadratic trend, there’s a high level of attriction in late teens and early 20s. It levels off in the 30s, and starts picking back up in the 50s

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

##Attrition VS JobSatisfaction

Seems to be a very strong correlation between JobSatisfaction and attrition rate, with the greater job satisfaction the better less the likelhood for attrition.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

##Attrition VS total working years

Similar to age it seems that there is less likelihood

## Warning: Removed 10 rows containing missing values (geom_point).

##Attrition VS Job Role Sales representative appear to have a much higher attrition rate

##Attrition VS PercentSalaryHike There’s a very small correlation between percent salary hike and attrition

##Attrition VS hourly rate Doesn’t appear to be any real correlation between hourly rate and attrition

## Warning: Removed 12 rows containing missing values (geom_point).

##Attrition VS OverTime Working overtime appears to have a significant impact on attrition rate

##Attrition VS Monthly Income

## Warning: Removed 813 rows containing missing values (geom_point).

##Testing Bayes models with factor that had the most impact on attrtion Age, Job Satisfaction, Job Role, Totalworkinyears and Hourly Rate, and then find the best model

## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  225   2
##   Yes  32   2
##                                           
##                Accuracy : 0.8697          
##                  95% CI : (0.8227, 0.9081)
##     No Information Rate : 0.9847          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.08            
##                                           
##  Mcnemar's Test P-Value : 6.577e-07       
##                                           
##             Sensitivity : 0.87549         
##             Specificity : 0.50000         
##          Pos Pred Value : 0.99119         
##          Neg Pred Value : 0.05882         
##              Prevalence : 0.98467         
##          Detection Rate : 0.86207         
##    Detection Prevalence : 0.86973         
##       Balanced Accuracy : 0.68774         
##                                           
##        'Positive' Class : No              
## 
## [1] 0.842069
## [1] 0.002061901
## [1] 0.8508076
## [1] 0.002113658
## [1] 0.6040192
## [1] 0.002113658
## [1] 0.8407663
## [1] 0.00207462
## [1] 0.8507683
## [1] 0.002080485
## [1] 0.5916438
## [1] 0.002080485
## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  214   4
##   Yes  38   5
##                                           
##                Accuracy : 0.8391          
##                  95% CI : (0.7888, 0.8815)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1435          
##                                           
##  Mcnemar's Test P-Value : 3.543e-07       
##                                           
##             Sensitivity : 0.8492          
##             Specificity : 0.5556          
##          Pos Pred Value : 0.9817          
##          Neg Pred Value : 0.1163          
##              Prevalence : 0.9655          
##          Detection Rate : 0.8199          
##    Detection Prevalence : 0.8352          
##       Balanced Accuracy : 0.7024          
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  211   7
##   Yes  37   6
##                                           
##                Accuracy : 0.8314          
##                  95% CI : (0.7804, 0.8748)
##     No Information Rate : 0.9502          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1492          
##                                           
##  Mcnemar's Test P-Value : 1.232e-05       
##                                           
##             Sensitivity : 0.8508          
##             Specificity : 0.4615          
##          Pos Pred Value : 0.9679          
##          Neg Pred Value : 0.1395          
##              Prevalence : 0.9502          
##          Detection Rate : 0.8084          
##    Detection Prevalence : 0.8352          
##       Balanced Accuracy : 0.6562          
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  226   1
##   Yes  34   0
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8185, 0.9048)
##     No Information Rate : 0.9962          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0075         
##                                           
##  Mcnemar's Test P-Value : 6.338e-08       
##                                           
##             Sensitivity : 0.8692          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9956          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9962          
##          Detection Rate : 0.8659          
##    Detection Prevalence : 0.8697          
##       Balanced Accuracy : 0.4346          
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  223   4
##   Yes  29   5
##                                          
##                Accuracy : 0.8736         
##                  95% CI : (0.827, 0.9113)
##     No Information Rate : 0.9655         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1883         
##                                          
##  Mcnemar's Test P-Value : 2.943e-05      
##                                          
##             Sensitivity : 0.8849         
##             Specificity : 0.5556         
##          Pos Pred Value : 0.9824         
##          Neg Pred Value : 0.1471         
##              Prevalence : 0.9655         
##          Detection Rate : 0.8544         
##    Detection Prevalence : 0.8697         
##       Balanced Accuracy : 0.7202         
##                                          
##        'Positive' Class : No             
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  224   3
##   Yes  30   4
##                                          
##                Accuracy : 0.8736         
##                  95% CI : (0.827, 0.9113)
##     No Information Rate : 0.9732         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1577         
##                                          
##  Mcnemar's Test P-Value : 6.011e-06      
##                                          
##             Sensitivity : 0.8819         
##             Specificity : 0.5714         
##          Pos Pred Value : 0.9868         
##          Neg Pred Value : 0.1176         
##              Prevalence : 0.9732         
##          Detection Rate : 0.8582         
##    Detection Prevalence : 0.8697         
##       Balanced Accuracy : 0.7267         
##                                          
##        'Positive' Class : No             
## 

This is the better model with Accuracy: 0.8697,Sensitivity : 0.8740 and Specificity : 0.7143

## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  222   2
##   Yes  32   5
##                                           
##                Accuracy : 0.8697          
##                  95% CI : (0.8227, 0.9081)
##     No Information Rate : 0.9732          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1908          
##                                           
##  Mcnemar's Test P-Value : 6.577e-07       
##                                           
##             Sensitivity : 0.8740          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.9911          
##          Neg Pred Value : 0.1351          
##              Prevalence : 0.9732          
##          Detection Rate : 0.8506          
##    Detection Prevalence : 0.8582          
##       Balanced Accuracy : 0.7942          
##                                           
##        'Positive' Class : No              
## 
## [1] 0.8406897
## [1] 0.002048626
## [1] 0.8509524
## [1] 0.002089431
## [1] 0.57255
## [1] 0.002089431

#The best Bayes model includes Age, JobRole, JobSatisfaction, and Overtime #Accuracy of 85%, sensitiviy of .859 and specificity of .64

## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962

##Comparing against Knn Model, results as below mean(AccHolder) 0.8330268 sd(AccHolder)/sqrt(100) 0.00218046 mean(SensHolder) 0.8517895 sd(SensHolder)/sqrt(100) 0.002144115 mean(SpecHolder) 0.4511064 sd(SensHolder)/sqrt(100) 0.002144115

## integer(0)
## [1] NA
## [1] 0.8330268
## [1] 0.00218046
## [1] 0.8517895
## [1] 0.002144115
## [1] 0.4511064
## [1] 0.002144115

Attrition Prediction with the best model Accuracy: 0.849387 ,Specificity: 0.8587652 and Sensitivity: 0.6427007

##The best Bayes model includes Age, JobRole, JobSatisfaction, and Overtime

##EDA for imputing Monthly Income

So far highest correlatoin is between Total working years and monthly income Total working years has a .779 corr while years at company has .491 corr JobLevel has a corr of .952 Age has a .485 correlation Years since last promotion has a .316 correlation

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##First Model for computing Monthly Incomes First model using Joblevel and income has a rmse of 1410.878

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4642.2  -668.0  -107.3   668.3  4412.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2743.82      69.69   39.37   <2e-16 ***
## JobLevel2    2800.46      99.89   28.04   <2e-16 ***
## JobLevel3    7108.38     130.24   54.58   <2e-16 ***
## JobLevel4   12509.83     177.45   70.50   <2e-16 ***
## JobLevel5   16480.15     219.18   75.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9244 
## F-statistic:  2658 on 4 and 865 DF,  p-value: < 2.2e-16
##                 2.5 %    97.5 %
## (Intercept)  2607.044  2880.604
## JobLevel2    2604.402  2996.509
## JobLevel3    6852.766  7363.996
## JobLevel4   12161.551 12858.101
## JobLevel5   16049.957 16910.342
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4642.2  -668.0  -107.3   668.3  4412.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2743.82      69.69   39.37   <2e-16 ***
## JobLevel2    2800.46      99.89   28.04   <2e-16 ***
## JobLevel3    7108.38     130.24   54.58   <2e-16 ***
## JobLevel4   12509.83     177.45   70.50   <2e-16 ***
## JobLevel5   16480.15     219.18   75.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9244 
## F-statistic:  2658 on 4 and 865 DF,  p-value: < 2.2e-16
## [1] 1216.151

##2nd Model ading TotalWorkingYears

Adding the totalworkingyears got a better error with 1365

## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4957.9  -657.8  -134.6   618.2  4525.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2544.901     89.085  28.567  < 2e-16 ***
## JobLevel2          2652.205    107.666  24.634  < 2e-16 ***
## JobLevel3          6820.371    152.732  44.656  < 2e-16 ***
## JobLevel4         11858.212    254.564  46.582  < 2e-16 ***
## JobLevel5         15800.546    289.997  54.485  < 2e-16 ***
## TotalWorkingYears    33.442      9.426   3.548 0.000409 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared:  0.9258, Adjusted R-squared:  0.9254 
## F-statistic:  2157 on 5 and 864 DF,  p-value: < 2.2e-16
##                         2.5 %      97.5 %
## (Intercept)        2370.05330  2719.74815
## JobLevel2          2440.88699  2863.52254
## JobLevel3          6520.60145  7120.13957
## JobLevel4         11358.57533 12357.84882
## JobLevel5         15231.36546 16369.72723
## TotalWorkingYears    14.94155    51.94211
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4957.9  -657.8  -134.6   618.2  4525.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2544.901     89.085  28.567  < 2e-16 ***
## JobLevel2          2652.205    107.666  24.634  < 2e-16 ***
## JobLevel3          6820.371    152.732  44.656  < 2e-16 ***
## JobLevel4         11858.212    254.564  46.582  < 2e-16 ***
## JobLevel5         15800.546    289.997  54.485  < 2e-16 ***
## TotalWorkingYears    33.442      9.426   3.548 0.000409 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared:  0.9258, Adjusted R-squared:  0.9254 
## F-statistic:  2157 on 5 and 864 DF,  p-value: < 2.2e-16
## [1] 1203.668

##3rd Model adding age as well

Found that adding the factors with most Correllations, that being JobLevel, Age, TotalWorkingYears gave the lowes RMSE of around 1200.

## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears, 
##     data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4947.4  -652.9  -136.8   615.3  4542.1 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2417.795    197.719  12.228   <2e-16 ***
## JobLevel2          2653.840    107.720  24.636   <2e-16 ***
## JobLevel3          6825.867    152.965  44.624   <2e-16 ***
## JobLevel4         11869.515    255.118  46.525   <2e-16 ***
## JobLevel5         15811.576    290.482  54.432   <2e-16 ***
## Age                   4.548      6.316   0.720   0.4716    
## TotalWorkingYears    29.545     10.871   2.718   0.0067 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared:  0.9259, Adjusted R-squared:  0.9254 
## F-statistic:  1797 on 6 and 863 DF,  p-value: < 2.2e-16
##                          2.5 %      97.5 %
## (Intercept)        2029.728420  2805.86250
## JobLevel2          2442.416087  2865.26407
## JobLevel3          6525.639779  7126.09387
## JobLevel4         11368.789547 12370.24015
## JobLevel5         15241.442578 16381.70959
## Age                  -7.847635    16.94390
## TotalWorkingYears     8.209551    50.88143
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears, 
##     data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4947.4  -652.9  -136.8   615.3  4542.1 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2417.795    197.719  12.228   <2e-16 ***
## JobLevel2          2653.840    107.720  24.636   <2e-16 ***
## JobLevel3          6825.867    152.965  44.624   <2e-16 ***
## JobLevel4         11869.515    255.118  46.525   <2e-16 ***
## JobLevel5         15811.576    290.482  54.432   <2e-16 ***
## Age                   4.548      6.316   0.720   0.4716    
## TotalWorkingYears    29.545     10.871   2.718   0.0067 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared:  0.9259, Adjusted R-squared:  0.9254 
## F-statistic:  1797 on 6 and 863 DF,  p-value: < 2.2e-16
## [1] 1203.884

##Predict the salary with test_salary_data using linear model (MonthlyIncome~JobLevel+Age+TotalWorkingYears)